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Abstract — Slepian-Wolf theorem is a well-known framework 
that targets almost lossless compression of (two) data streams 
with symbol-by-symbol correlation between the outputs of (two) 
distributed sources. However, this paper considers a different 
scenario which does not fit in the Slepian-Wolf framework. We 
consider two identical but spatially separated sources. We wish to 
study the universal compression of a sequence of length n from 
one of the sources provided that the decoder has access to (i.e., 
memorized) a sequence of length m from the other source. Such 
a scenario occurs, for example, in the universal compression of 
data from multiple mirrors of the same server. In this setup, the 
correlation does not arise from symbol-by-symbol dependency 
of two outputs from the two sources. Instead, the sequences are 
correlated through the information that they contain about the 
unknown source parameter. We show that the finite-length nature 
of the compression problem at hand requires considering a notion 
of almost lossless source coding, where coding incurs an error 
probability p e (n) that vanishes with sequence length n. We obtain 
a lower bound on the average minimax redundancy of almost 
lossless codes as a function of the sequence length n and the 
permissible error probability p e when the decoder has a memory 
of length m and the encoders do not communicate. Our results 
demonstrate that a strict performance loss is incurred when the 
two encoders do not communicate even when the decoder knows 
the unknown parameter vector (i.e., m — > oo). 

I. Introduction 

Many practical applications involve compression of data that 
are taken from multiple spatially separated sources. A key 
challenge in most of such applications is that the sources usu- 
ally cannot communicate with each other. Theoretical results 
by Slepian and Wolf demonstrate that if the data streams from 
two sources have symbol-by-symbol correlation, the sequences 
can be compressed to their joint entropy even when the two 
encoders do not communicate (TJ. In other words, as in Fig. Q] 
assume that sources S\ and Si wish to transmit the sequences 
y n and x n , respectively, to a node R. As the length n of the 
sequences increases, the decoding of x n at R with the help 
of y n can be performed using a code with the average length 
that asymptotically approaches the conditional entropy, (i.e., 
H (X n \Y n )) with asymptotically zero error probability. If the 
decoder did not choose to use y n in decoding, the encoder at 
S2 would have to encode the sequence x n irrespective to y n 
with an average length that is lower bounded by H(X n ). Note 
that the conditional entropy H(X n \Y n ) may be significantly 
smaller than the individual entropy H(X n ). After recent 
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Fig. 1. The basic scenario for the compression of distributed sources. 

development of practical Slepian-Wolf (SW) coding schemes 
by Pradhan and Ramchandran 10, SW coding has drawn a 
great deal of attention as a promising technique for sensor 
networks [ 3 1 and distributed video coding (0] . 

The Slepian-Wolf theorem naturally suits applications where 
the (new) sequence x n from S2 (in Fig. Q]) can be viewed 
as a noisy version of the (previously seen) sequence y m that 
could possibly be exploited as side information to reduce the 
code length of x n . Data gathering from sensors that measure 
the same phenomenon is one example. However, in many 
scenarios, the compression of distributed sources cannot be 
modeled by the SW framework. As an example, consider 
the universal compression of data from the mirrors of the 
same server, where the sources are exact copies of each other. 
Hence, it is plausible to assume that the sources (Si and S2 
in Fig. [TJ follow the same statistical model. On the other 
hand, the source model might be unknown requiring universal 
compression |5|-[7|. The question is, assuming two identical 
sources Si and S2 and having y' m from Si at the decoder, 
what is the achievable universal compression performance on 
x n at S2 provided that the encoders at Si and S% do not 
communicate. 

We stress that the nature of this problems is fundamentally 
different from those addressed by the Slepian-Wolf (SW) 
theorem in QD- Here, instead of symbol-by-symbol correlation 
between the sequences as in SW setup, the redundancy is due 
to the fact that when the source parameter is a priori unknown 
there is significant overhead in the universal compression of 
finite-length sequences 0-||9). Considering the example in 
Fig. Q] with two identical sources Si and S2, y m and x n 
would be independent given that the source model is known. 
However, when the source parameter is unknown, y m and x n 
are correlated with each other through the information they 
contain about the unknown source parameter. The question is 
whether or not this correlation can be potentially leveraged by 
the encoder of S2 and the decoder at R in the decoding of x n 
using y m in order to reduce the code length of x n . 

In this paper, we study the universal compression of dis- 
tributed identical sources. By identical we mean that the 



sources (Si and S2) share the same unknown source parameter. 
By distributed we mean that the sources are spatially separated 
and the encoders do not communicate with each other. This 
problem can also be viewed as universal compression with 
training data that is only available to the decoder. It is known 
that forming a statistical model from a training data set would 
improve the performance of universal compression [TTOl . iTTTTl . 
In |9), lfl2ll . we theoretically derived the gain that is obtained 
in the universal compression of the new sequence x n from S2 
by memorizing (i.e., having access to) y m from S\ at both 
the decoder (at R) and the encoder (at £2). This corresponds 
to the reduced case of our problem where the sources Si 
and S2 are either co-located (a single source) or allowed to 
communicate. For the reduced problem case, in ifTTI . |[T3l . we 
further extended the setup to a network with a single source 
and derived bounds on the network-wide gain where a small 
fraction of the intermediate nodes in the network are capable 
of memorization. However, as we demonstrate in the present 
paper, the extension to the multiple spatially separated sources, 
where the training data is only available to the decoder, is 
non-trivial and raises a new set of challenges that we aim 
to address. The rest of this paper is organized as follows. 
In Sec. m we briefly review the necessary background. In 
Sec. [Till we describe the problem setup. In Sec. [IV] we present 
our main results. In Sec. [V] we provide discussion on the 
results. In Sec. [VI] we present the technical analysis of the 
results. Finally Sec. I VIII concludes the paper. 

II. Background Review 

In this section, we review the necessary background, nota- 
tions, and definitions followed by the formal problem setup. 
Following the notation in |[T2l . let A be a finite alphabet. 
Let d be the number of the source parameters. Further, let 
9 = (61, 6d) denote the rf-dimensional parameter vector as- 
sociated with the parametric source (that is a priori unknown). 
Let <d d denote the space of rf-dimensional parameter vectors. 
We denote fig as the probability measure that is defined by 
the parameter vector 9. Let V d denote the family of sources 
that are described with the rf-dimensional unknown parameter 
vector 9 g <d d . We use the notation x 11 = (xi, x n ) G A n to 
present a sequence of length n from the alphabet A. We further 
denote X n as a random sequence of length n (that follows the 
probability distribution /ig). Let H n (9) be the source entropy 
given 9, i.e., H n {9) = Elog (^y)S 

Let c n : A n — > {0, 1}* be an injective mapping from 
the set A n of the sequences of length n over A to the set 
{0, 1}* of binary sequences. Next, we present the notions of 
strictly lossless and almost lossless source codes, which will 
be needed for the study of UC-DIS. 

Definition 1 The code c„(-) : A n — > {0, 1}* is called 
strictly lossless (also called zero-error) if there exists a reverse 

'Throughout this paper, all expectations are taken with respect to the 
probability measure fig, and log(-) denotes the logarithm in base 2. 



mapping d n (-) : {0, 1}* —> A n such that 

Vx n g A n : d n (c n (x n )) = x n . 

All of the practical data compression schemes are examples of 
strictly lossless codes, namely, the arithmetic coding, Huffman, 
Lempel-Ziv, and Context-Tree-Weighting algorithms. 

On the other hand, due to the distributed nature of the 
sources, we are concerned with the slightly weaker notion of 
almost lossless source coding in this paper. 

Definition 2 The code c^ e (-) : A n —¥ {0, 1}* is called almost 
lossless with permissible error probability p e (n) = o(l), if 
there exists a reverse mapping d^{-) : {0,1}* — > A n such 
that 

E{l e (X")} <p e (n), 
where l e (x n ) denotes the error indicator function, i.e, 

le ( x n) = { 1 d p n °(?%(x n ))^x n , 
e \ otherwise. 

The almost lossless codes allow a non-zero error probability 
p e (n) for any n while they are almost surely asymptotically 
error free. Note that strictly lossless codes correspond to 
p e (n) = 0. The proofs of Shannon lfl4l for the existence of 
entropy achieving source codes are based on almost lossless 
random codes. Further, the proof of the SW theorem |fl] 
also uses almost lossless codes. Further, all of the practical 
implementations of SW source coding are based on almost 
lossless codes (cf. J2), Q). We stress that the nature of the 
almost lossless source coding is different from that incurred 
by the lossy source coding (i.e., the rate-distortion theory). In 
the rate-distortion theory, a code is designed to asymptotically 
achieve a given distortion level as the length of the sequence 
grows to infinity. Therefore, since the almost lossless coding 
asymptotically achieves a zero-distortion, in fact, it coincides 
with the special case of zero-distortion in the rate-distortion 
curve. 

III. Problem Setup 

We present the problem setup in the most basic scenario, 
shown in Fig. [TJ consisting of two identical sources located 
in nodes Si and S2, and the destination node R. We let 
the information sources at Si and S2 be parametric with an 
identical d-dimensional parameter vector that is unknown a 
priori to the encoder and the decoder. Let y m and x n denote 
two sequences with lengths m and n, respectively, that are 
generated by the unknown information source model. In the 
sequel, we describe the communication scenario for universal 
compression of distributed identical sources. We assume that 
Si has transmitted the sequence y m to R. Next, at some later 
time, 52 wishes to send x n to R. We further assume that R 
is a memory unit and is capable of memorizing the sequence 
y m . We investigate the achievable saving in the compression of 
x n in the S2-R link when R has memorized the sequence y rn . 
Note that S2 does not have access to the sequence y m . If the 
node R did not have a memory unit, S% would have to apply 
an end-to-end universal compression to x n . However, the side 



information provided by y m at R about the source parameter 
can potentially result in a reduction in the amount of bits 
required to be transmitted in the S2-R link. Throughout the 
paper, we refer to this problem setup as Universal Compression 
of Distributed Identical Sources (UC-DIS). 

In the study of coding strategies for UC-DIS, we compare 
the following cases for the compression of x n at S2. Note that 
we assume that y m is already universally compressed at Si 
and transmitted and decoded at R. 

• UComp (Universal compression), which only applies 
end-to-end lossless universal compression to x n at S2 
without regard to y m . 

• UCompM (Universal compression with memorization at 
both the encoder and the decoder), which assumes that 
the encoder (at S2) and the decoder (at R) have access to 
a common memory (i.e., sequence y m ), which is utilized 
in the lossless compression of x n at 5*2. 

• DUCompM (Distributed universal compression with 
memorization at the decoder), which assumes that de- 
coder (at R) has memorized (i.e., has access to) y m while 
the encoder (at S2) only knows the length m of the side 
information but does not know the exact sequence y m . 
The encoder then applies an almost lossless code to x n 
that is decoded at R with permissible error probability p e 
using y m . 

Note that UComp does not benefit from the memorization and 
is the conventional scheme. Further, UCompM is introduced as 
the benchmark for the purpose of evaluating the performance 
of DUCompM and is not practically useful since it requires 
the sequence y rn from Si to be available at the encoder of S2 ■ 
Let l n (x n ) denote the strictly lossless length of the code- 
word associated with the sequence x n . Further, let L n denote 
the space of strictly lossless universal length functions on a 
sequence of length n. Denote R n (l n , 9) as the expected redun- 
dancy of such strictly lossless codes on a sequence of length n 
for the parameter vector 9, i.e., R n (l n ,9) = ~El n (X n )—H n (9). 
Further, denote Rucamp(n) as the average minimax redundancy 
as given by 



-RuComp^) 



min sup R n (l r , 



(1) 



In UCompM, let l n \ m be the strictly lossless universal 
length function with a memory sequence of length m. Denote 
L n \ m as the space of such strictly lossless universal length 
functions. Let Rn(l n \ m ,0) be the expected redundancy of 
encoding a sequence of length n form the source fig using 
the length function l n \ rn . Further, let -RucompM(«-, m) denote 
the corresponding average minimax redundancy, i.e., 



^UCompM (n, m) 
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(2) 



In DUCompM, let Q m 



denote the almost lossless universal 
length function with a memorized sequence of length m that is 
only available to the decoder, where the permissible error prob- 
ability on decoding x n is p e . Further, denote Lih m as the space 

of such universal length functions. Denote R n (P^ m ,9) as the 
expected redundancy of encoding a sequence x n of length n 



using the length function Q m . Denote i?DUCom P M( n > TO ) as the 
expected minimax redundancy as given by 

(n,m) = „ min sup Rn{l p J m , 0). (3) 



^DUCompM 



Note that we denote R DVC om V M{n, m) = #DucompM( n : m ) 
as the expected minimax redundancy of strictly lossless 
DUCompM coding strategy. 

IV. Performance Evaluation of UC-DIS: 
Results on the Average Minimax Redundancy 

In this section, we provide results on the average minimax 
redundancy of the different coding strategies introduced in the 
previous section for the UC-DIS problem. Discussion on the 
implications of the results and the proof sketches are deferred 
to Sec. IVland Sec. IVI1 respectively. 

In the case of strictly lossless UComp, Clarke and Barron 
derived the expected minimax redundancy -Rucomp(^) for 
memory less sources 0151 . which was later generalized by 
Atteson for Markov sources, as the following ifRH : 

Theorem 1 The average minimax redundancy of strictly loss- 
less UComp coding strategy is given by 

Rucomp(n) = ^ log (^-) + log J \T n {9)\^d9 + O Q 

where X n {9) is the Fisher information matrix. 

In the case of strictly lossless UCompM (i.e., when the two 
encoders can communicate), we obtain the average minimax 
redundancy in the following theorem. 

Theorem 2 The average minimax redundancy of strictly loss- 
less UCompM coding strategy is given by 

d. 

RucompM(n, m) — — 



In the next proposition, we confine ourselves to strictly 
lossless codes in the DUCompM strategy. 

Proposition 3 The average minimax redundancy of strictly 
lossless DUCompM coding strategy is equal to that of UComp 
coding strategy. That is RDUCom P M(n,m) = Rucomp(n). 

Finally, in the case of almost lossless DUCompM, our main 
result is given in the following theorem. 



Theorem 4 The average minimax redundancy of almost loss 
less DUCompM coding strategy is upper bounded by 
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R DUCompM( n > m ) ^ RuCompM(n, m)+J 7 (d,p e ) +0 



where ^(djPe) is the penalty term due to the encoders not 
communicating, which is given by 



F(d,Pe) = - log ( 1 + — — log — 
2 V dloge p e 



(4) 



V. Discussion on the Results 

In this section, we provide some discussion on the signif- 
icance of the results for different UC-DIS coding strategies. 
Figures [2] and [3] demonstrate the redundancy rate for the three 
coding strategies, namely, UComp, UCompM, and DUCompM 
for memoryless sources and first-order Markov sources with 
alphabet size k = 256, respectively. In the case of UComp, 
Theorem Q] defines the achievable average minimax redun- 
dancy for the compression of a sequence of length n encoded 
without regard to the previously seen sequence y m . 

According to Theorem [2] if the encoder and the decoder 
have access to a common memory y rn , i.e., UCompM coding 
strategy, the average minimax redundancy could be much 
smaller than that of UComp depending on how large m is. In 
particular, when m — > oo we have lim m _>. 00 -RucompM(", m) — 
00 This corresponds to the case where the parameter vector 
is known to both the encoder and the decoder, and thus, the 
redundancy is zero similar to a perfect Shannon code. Hence, 
the fundamental limits are those of known source parameters 
and universality no longer imposes a compression overhead. 
This is also demonstrated in Figs. [2] and [3] where m has been 
chosen to be sufficiently large. 

Proposition [3] demonstrates that if strictly lossless 
DUCompM coding strategy (i.e., p e = 0) is to be used for the 
compression of x n from S2, the memorization of y' m from 
S\ only at the decoder does not provide any compression 
benefit, assuming that the two encoders at Si and S2 do not 
communicate. In other words, the best that S2 can do is to 
simply apply a traditional universal compression on x n . 

Finally, according to Theorem [4] unlike the asymptotic 
behavior of the Slepian-Wolf problem, the distributed nature in 
this problem incurs an extra redundancy on the compression. 
As can be seen in Fig. [2] the overhead can be significant in 
the compression of memoryless sources. For example, when 
n = 512B, m = 32kB, and p e = 10~ 6 , the redundancy rate is 
around 0.05, as compared with the almost zero redundancy rate 
of UCompM. On the other hand, as demonstrated in Fig. [3] 
when d is relatively larger, for medium length sequences even 
with extremely small error probability, DUCompM performs 
fairly close to UCompM. Further, DUCompM by far out- 
performs UComp in the compression of short to medium 
length sequences with reasonable permissible error probability, 
justifying usefulness of DUCompM in practice. If log ^- -C d, 
the penalty term can be further simplified to be approximately 
equal to ^(d^pe) ~ log ^- for the practical ranges of p e . 

VI. Technical Analysis 

A. Sketch of the Proof of Theorem \2\ 

We prove that the RHS is both an upper bound and a lower 
bound for -Rucompivi(«, m). The upper bound is obtained using 
the KT-estimator |fl9l along with a proper Shannon code lfl4l 
and the proof follows the analysis of the redundancy of the 
KT-estimator. In the next lemma, we obtain the lower bound. 

2 In this paper, we ignored the integer constraint on the length functions, 
which results in a negligible O(l) redundancy analyzed in 1171 . 1181 . 
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Fig. 2. The redundancy rate for the three coding strategies of interest 
for the UC-DIS problem. Memory size is m = 32kB and the source 
is memoryless with alphabet size k = 256. 
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Fig. 3. The redundancy rate for the three coding strategies of interest 
for the UC-DIS problem. Memory size is m = 16MB and the source 
is first-order Markov with alphabet size k — 256. 

Lemma 1 The average minimax redundancy of UCompM is 
lower-bounded by 



Rucom P M(n, m) > - log I 



Proof: It can be shown that the minimax redundancy is 
equal to the capacity of the channel between the unknown 
parameter vector and the sequence x" given the sequence 
y m (cf. [|8] and the references therein). Thus, 

i?ucom P M(n,m) = sup I(X n ;9\Y m ) 

= sup{I(X n ,Y m ;8) -I(Y m -6)} 
> {I(X n ,Y m ;6) - I{Y m ;9)}\ e ^ i(e) 

= -RuCompOl + m) - i?UComp(™), (5) 

where wj(9) = — — denotes the Jeffreys' prior, and 

fiucomp(') is given in Theorem Q] Further simplification of (0 
leads to the desired result in Lemma Q] ■ 

B. Sketch of the Proof of Proposition \3\ 

Since the source is assumed to be from the family V d 
of (i-dimensional parametric sources, in particular, it is also 
an ergodic source. Thus, any pair (x n ,y m ) occurs with non- 
zero probability and the support set of (x n ,y m ) is equal to 
A n x A m . Therefore, Proposition [3] trivially follows from the 



known results on strictly lossless compression (cf. [20] and 
the references therein). 

C. Sketch of the Proof of Theorem [4] 

We provide a constructive optimal coding strategy at the en- 
coder and obtain its achievable average minimax redundancy, 
which provides with an upper bound on the average minimax 
redundancy of the almost lossless DUCompM coding strategy. 

Let 9(x n ) (or 9(y m )) denote the Maximum Likelihood 
(ML) estimate for the unknown source parameter given 
that the sequence x n (or y' m ) is observed, i.e., 0(x n ) = 
argmaxA fix(x n ). Further, let Ox = 0(x n ) and 9y = 9{y m ). 
As discussed earlier fig(x n ) is the probability distribution 
induced by the parameter vector 9 on the sequence x n . It is 
straightforward to derive the pmf of the ML-estimate p(9x\9) 
from ne(x n ) by summing over all the sequences that corre- 
spond to the same ML-estimate. Note that Ox follows a dis- 
crete distribution only taking values on a finite set of (n + l) d 
points in the space Q d . For any A, 9 e Q d , let D n (n\\\iJ.e) 
be the KL-divergence, i.e., D n (p, x \\p, e ) = Elog ( ^pP) )- 

It can be shown that expectations with respect to p(9x\9) 
can be performed using a continuous RV 9x (with uniformly 
vanishing error) whose distribution conditioned on 9 is given 
by 

d 

p{9 x \9) = \l n (0x)\^ (^) 2 exp(-D n (psJ\f»))> < 6 > 

where n has to be large enough so that Stirling's approxima- 
tion can be applied. Further, it is straightforward to show that 
this distribution can be approximated using a Gaussian distri- 
bution with mean 9 and inverse covariance matrix ril n (9). 

Next, we will obtain an approximation for the distribution 
of Ox conditioned on 9y. 

Lemma 2 Let Ox and Oy denote the ML-estimate parameter 
given observed sequences x n and y m , respectively. Further, let 
p(0x\0y) follow a Gaussian distribution with mean Oy and 
inverse covariance matrix r ^ n I m (9y). Then, all expectations 
with respect to p(0x\Oy) can be performed using p(9x\9y) 
with uniformly vanishing error. 

Now, we are equipped to define S n (y m ,p e ) as the set with 
smallest Lebesgue volume such that 

/ p(9 x \0y)d9x>l'Pe. (7) 

The following lemma shows as to how S n (y m ,p e ) is deter- 
mined. 

Lemma 3 Let 9y denote the ML-estimate for the unknown 
parameter vector given sequence y m is observed. Then, 
S n (y m ,e) is given by 

S n (y m , Pe ) = {</> : r(<f> - 9 Y yi m (9 Y )(cb - 9 Y ) < 8 d ( Pe )} , 
where r = 5 d {p e ) satisfies T (f ,5 d {p e )) = p e T (f)H 

3 r(s,x) = Jg t s_1 e~ t dt denotes the incomplete Gamma function. 



The next lemma determines the probability measure of the set 
S n {y m ,Pe) under Jeffreys' prior. 

Lemma 4 Assume that the parameter vector 9 follows Jef- 
freys' prior. Then, the probability measure Ps(p e ) of the set 
S n {y m ,Pe) is given by 

Ps( Pe )=f °\ 

wkerer = ^L. and C d = 0t j . 

Next, consider the following coding scheme. Let the space 
be partitioned into ellipsoids of the form S n (y m ,p e ). Then, 
each sequence is encoded within its respective ellipsoid with- 
out regard to the rest of the parameter space. The decoder 
chooses the decoding ellipsoid using the ML estimate 9y and 
the permissible decoding error probability p e . The probability 
measure covered by each ellipsoid is Ps{p e ) is independent 
of Oy, and provides with — log Ps(p e ) reduction in the re- 
dundancy. Further, simplification of Ps{p e ) and the fact that 
Sd(Pe) ~ f loge + log will lead to the desired result. 

VII. Conclusion 

In this paper, we introduced and studied the problem of 
Universal Compression of Distributed Identical Sources (UC- 
DIS), which is a more favorable framework as compared to 
the Slepian-Wolf (SW) framework in several applications, such 
as the compression of data from mirrors of a data server. 
In UC-DIS, the correlation among outputs of the sources 
is due to the finite-length universal compression constraint, 
departing from the nature of the correlation in the SW 
framework. For UC-DIS, involving two identical sources, we 
introduced DUCompM coding strategy (compression using 
the side information at the decoder when the two encoders 
do not communicate) and obtained an upper bound on its 
average minimax redundancy. We demonstrated that for finite - 
length sequences with reasonable permissible error probability, 
DUCompM coding strategy by far outperforms traditional 
universal compression, and hence, justifying the usefulness of 
DUCompM coding strategy in practice. 
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